DOCOIEVT SXSOtB 



B» 120 2S0 



95 



TB 005 212 



10TH05 
TITLt 

IISTITMIOB 

SFOHS 16BBCT 

PUB DATE 
HOTB 



EDBS PSICS 
DESCBIPTORS 



Probl«i5 in the Developient of Crlterlon*5«£erenc«d 
Tests; The IPI Plttsb.tirgh Eiptirlence* 
Pittsburgh tJnlT*^ Pa, learning Sesearch and 
Developaent Center* 

Rational Inst* of Education <DHEB) ^ Washington^ 

D*C* 

74 

27p,: Footnotes aay not reproduce clearly due to 
saall type 

ap*«0,83 HG-*2*0€ Postage 

Achieveaent; Behavioral Objectives; Content Analysis; 
*Criterion Beferenced Tests; Instructional Systess; 
Task Perforpance; *Te8t Construction; *Testing 
Probless 



ABSTRACT 

These four characteristics inherent in 
criterion*referenced tests for* the central these of this paper; f1) 
The classes of behaviors that define different achieveaeht levels are 
specified as clearly as possible before the test is constructed; <2j 
Each behavior class is defined by a set of test situations <that is^ 
test tasks) in vhich the behaviors can be displayed in terss of all 
their isportant nuances; <3) Given that the classes of behavior have 
been specified and that the test situations have been defined a 
representative saspling plan is designed and used to select the test 

irs^tlfat^^l^lnipirear'^ir^ 
score su^t be capable of expressing objectively and seaningfully the 
individual*s performance characteristics in these classes of 
behavior. The focus of this paper is on the developsent of 
criterion^referenced tests having these four properties and sose 
associated, technical probless that are encountered* Solutions for 
these technical probless are not readily available nor. issediately 
generali^able to all curricular areas for which criteripn^referenced 
tests sight be desired, Attespts are sade^ therefore^ to /specify 
procedures that irill be useful to the practical develop^er until the 
technical probless are solved, (RC) 



********************** ***********^*******^******e**********i|e*!»********* 

* Docusents acquired by EBIC include sany inforsal unpublished * 

* sateriais not available fros other sources* EBIC s^kes every effort ^ 

* to obtain the best copy available* Beverttiieles^t Itess of sarginal * 

* reproducibility are often encountered and this affs^ftif' the guality * 

* of the slcrofiche and hardcopy reproductions EBIC sakfs available * 

* via the EBIC Docusent Beproductlon Service <EDBS)* EDBS is not * 
^ responsible for the quality of the original docusent, Beprpductlons f 

* supplied by EDBS are the best that can be sade fros the 6rigint^l* * 

^********4r******«**«***4r*******4r**«******^**«*****4r*^#***^*4**** ****** 



o 
o 



f^4TlOHAL INSTITUTE OF 

eoijCatiox 
Trt*S oOCUwEnt has »£En uepftO. 

OuCeO EXACTLY AS RECEivEO FQOAi 
tH£ PES^fON Of} OI{Gani:ATiOnOi44GiN^ 
ATING IT POiNtSOF vtEWOtt OPIMON^ 
STATEO 00 not NECE^^AftiLV Q^PRE' 

SENT OFF|C(ALNAt|ONA^ iNfTrTuT^O^ 

COuCaTiON pOSuiOn Oft POliC^ 



@ 



University of PSttslxirgh 



> 
2 
H 
J 
0 

z 



0 



"0 

0 
r 



X 



n 
r 

0 

H 
0 

3 



c 

Q 
X 
m 

X 

n 

z 
o 
n 



er|c 



PROBLEMS IN THE DEVELOPMENT OF 
CRITERION-REFERENCED TESTS: 
THE IP! PITTSBURGH EXPERIENCE 



Anthony J, Nitko 



Learning Research and Development Center 
University of Pittsburgh 



Reprinted with permission from C, Harris, M* C, Alkin^ and 

S* Popham (Eds* Problems in Criterion-Referenced Measurement* 
(CSE Monograph Series in Evaluation 3- ) Los Angeles: University of 
California, Center for the Study of Evaluation, 1974, Pp, 59-82, 



Preparation of this paper was supported by the Learning Research and 
Development Center, which is supported in part by funds from the National 
Institute of Education (NIEJ, United States Department of Healthj Educa- 
tion and Welfare. Tha opinions expressed do not necessarily reflect the 
position or policy of NXE and no official endorsement should be inferred* 



ERIC 



3 



PROBLEMS IN THE DEVELOPMENT OF 
CRITERION REFERENCED TESTS: 
THE IPI PITTSBURGH EXPERIENCE 



Anthony Nitko 
Learning Research and Development Center 
University of Pittsburgh 



Cnterion-refercrteed testing has come a1>oiit on a new wave of psy<:hol- 
Ogy—a p^chology expressing an increasing <:On<:em for inslniclion and 
the instnictional process. Such an InMructioiial psychology postulates a 
theory of instruction that is prevcriptive with respect to the instructional 
procetbirc itself A learning theory, on the other hand, in descriptive and 
after the fact, specifying the conditions under which the learning occurred 
(Bunwr, 1966), 

In thairies of iiistnictional psychology primary focus is on - - - (a) a 
description of the 5tate of laiowlcdgu to be achieved; (b) de^criptioh 
«>f the initial state with which one the learner) begjn5; (c) actions 
which can l>e taken, or conditions that e:in be implemented to transform 
the initial state; (d) assesstneat of the transforaiatton (>f the state that 
results from e^h action; and (e) evaluation of the statement of the 
tenninal sttite desired (Glawr & Resniek, 1072, p. 208), 

Clascr's motive for applying eritcr ion-referenced testing to educational 
achievement measurement (Ctaser, 1963) stemmed from a concern alxMit 
the kind of achievement information needed to make instructional dedsions 
from the above kind of instructional psychology* Some instnictional de- 

cjsions^cmceriUndb^ditaIi^nd_jTiayj:e)ate,Jpr_exam^ 

competence an individual needs in order for liim to be successful in the 
f>cxt eonrse of a sequence. Other decisions center arotind the adequacy 
of the instnictional procedure itself. Tests that provide achievement infor- 
mation al>out an Indtvidiiat only in terms of how the individual compared 
with other mem1>ers of the group tested, or which provide only sketchy 
information al>out the degree of competence die individual possesses with 
respect to some desired educational outcome, are not sufHcicnt to make 
the kinds of dcc*isions necessary for effective iastructional design and 
guidance. 

Clascr's (196.3) application combined both the notion of a desired model 

hqparalion of ihi^ pA[M*r wa^ .<(ijpp<>rlc<) Uy ttic U^^nnnji; nc^arcli and Devciopmcnl 
Center, wtikch issiippoaedin pint hy funds frmii the N^ttional Institute of £cltic;ition (NIE), 
VS. Ikpnrtmcfit of MtMlth, ivdiication ^ncl Wcir*uc. The opjnion^ express clonot ncccsf^nly 
reftect the positioner polit-y of NJE and no official endorsement shotlld l>c infeifed. Gfiteful 
acknowIciTgetuent h m^da to Om. Cooley^ Cox^ Cht^r* iHn. and Besnicfr for tbeir helpful 
comirKnts ftn the drtiFt n>nnnscript. 





60 PROBLEMS IN CRITERION' REFEHt^XCBD MEASUREMENT 



or a minimum goal we would tike an individual to attain (Flanagan, l95l) 
and the notion of a standaid domain of content (£l>ett 1962). fie called 
ton the j^-ific-aticH) of tfoe type of behavioi the individual is required 
to demonstrate with respect to the content. This distinction between 
behavior or performance and content is at the heart of criterion-referenced 
testing. '*Tbe standard (or cnterioii] against which a student^s performance 
is compared ... is the behavior which defines each point along the 
achievimient continuum (Gtaser, 1963, p.519)/' A criterion-feferenceit test, 
(fcen, tf otie thai is det&erately amsiructed to fijive scores that teU what 
kindf of behavior individuab with those scores can denumstntte (Glaser 
& Nitko. ]97]). 

-Note that this definition does not imply a predetermined, fixed cutting- 
score (cf, e.g., Livingston, J972); it does not imply simp(y^ writing a set 
of l)ehaviora] objectives and keying a set of item^i to'diose objectives; 
and it does not imply the use of only open-ended production items (cf. 
Harris Ac Stewart, 197]). The definition, instead, implies that there are 
four charactcri^ics inherent in criterion-referenced te^^s: 

The classes of behaviors that de6ne different achievement levels are 
specified as clearly as possible before the te^t is constructed. 

Each behavior cta&E is defined by a set of test situations (that is* test 
tasks) in which the behaviors can ^>e displayed in terms of all their 
important nuances. 

Given that the classes of beliavJor have been specified and that the 
test situations have been defined, a representative samplinj; plan is 
desi^M^d*and used to select the test tasks that will appear on any 
form <^ the test. 

.T>ie«ohtained,^ntjn*ist Jisj^paW^ ^ _«pressing^ ohj^ and 

meaning^lly the indivictuaVs performance characteristics m tfiese^^ 
classes of l>ehavior (Nitko, lf)70). 

These four characteristics form the centra) theme of this paper. The 
focus is on the development of criterion-referenced tests having these 
properties and some associatoit technical prohtem^s that are encountered. 
Solutions for these technical problems are not readily available. nor imme- 
diately generalizable to all ctirriciilar areas for which criterion-referenced 
tests mig^ be desired. Attempts are made> therefore^to specify procedures 
tfiat v^n foe useful to the practical developer until the technical problems 
are sc^ed. 

The characteristics outlined above appear to form a logical develop^ 
mental sequence. This sequence ist seldom followed in practice. In fact, 
a great deal of critertor>-ieferenced test development is still in the intuitive 
or artistic state. More often than not the procedure is iterative. For 
«i^Ample, attempts to specify classes of behavior may begin by first specify- 
. , varieties of test items. These items mjeht Jbe subjected to behavioral 



NITKO 61 



analysts and b^vioral c-tass descriptions are then induced. This may lead 
to further specification of items or redefinition of t>ehavior classes. 

Permeating atl of the discitvsion that follows is the notion of a theory 
of performance (Miller, 1962; Hively, 1970) or an analysis of the psyc^io- 
ibgical processes underlying task performance. This type of process analysis 
is iised to .stnicture the classes of behavior defining various levels of 
achievement and in interpreting specific item performance as representing 
the class of behavior defined. 

DOMAIN DEFINITION 

Of the four characteristics of criterion-referenced testing' outlined ear- 
lier, specifying classes of l>ehavior that defipe different levels of achieve* 
ment is the most diflk-ult to achieve. The failure to adequately specify 
this domain has ted to recent criticisms of criterion -referenced testing 
(e.g., Ebcl. 1970; Stanley & Hopkms, 1972). Since these criticism's hark 
l>ack to the inadei|iiac7 of the old percentage grading system, perhaps 
the demise of that system was also due to the domain specification failure. 

A complete exposition on domain specification is beyond the scope of 
this paper. It is asefiit, however, to sketch out some of the dimensions 
of the prot>lem so that the practical developer of criterion-referenced tests 
may take them into c^onsiideration. These dimensions include establishing 
various levels of achievement, the relationship l>etween ultimate and 
proximate achievement levels^ the nature of the domain specification, and 
the derivation of domain descriptions. 

Levek of Achievement 

Performance or achievement criteria can be established at any conveni- 
ent point in the instnic^tional process. For example, the classes of behavior 
defining various levels of competence can be specified at the termination 
■ of a'cmirseratt be ' terminat ion-of ^ a-ti n i t^of ■ inst met ion'(i .tu ' smal le r- wit hi n- 
course segments of instniction), or at any other point during the course 
of instmclion. The definition of these behavior domains will \}e giiided 
ly the nature of the instnictional system and the purpose for v^/bich the 
information will l>e use<l, e.g., certification of attain ment, within curriculum 
placement, or diagnosis of deficiencies (cf. Claser & Nitko, 1971). 

At the termination of insiniction broad domains of performance are 
definable. The definition and analysis of these domains occur at several 
levels ranging from the definition of the desired outcomes of the entire 

'While it may be useAil ^or itomc to avoid the term eriterion- referenced testing avMl focMS 
on criterion -refurcrK«<l sitire interpretation (e.j;., SfrnoiK 1969: D4Vi$, 1970)^ it secmii more 
iKcfid tn rekr to ^'tests'' in the <:nntcxt of thi^ paptr. In order tn have criterion 'referenced 
icore intc^rpretutioftf MXirc^ nce^l to referenced back |n the iKhavior donuin. tfenee, 
ftjciK in development ^mild Ik; primarily on :he domain of Irehavior ami the derivation 
of tc^t tasks tn elieit that bchftvior. rather than jhnrt -cutting Ihtse and focusing mainly 
on the scores (ef.. Jitcbon, lft7l). 



ERIC 



6 



62 FKOBLBMS IN CIUTKKION-KEFEKENCED MEASUREMENT 



educational enUTprise* at one extreme^ to the .specification of the debited 
outcomes at ttie termination of a [)artjLiilar siihjcct-mattcr L-oiirso, at tfic 
other extreme. The former Ls likely to yield many domain clefinittoas 
l)e tKvergentt and require many tests in order to assess pupil outcomes. 
The latter leads to fewer domains is more convergent in terms of outcome 
L*ateg0ries, aiul may result in fewer tests. 

Uftinutte ami Proximate Bchuvior 

Defining levels of ai-hievement at various points in instniLtion raises 
the issue of what kinds of ])ehavior are important enough to l>e mdiided 
in a domain spcL-ifiL-ation. While this is an old area and siiliject to i-onsid- 
erable (k;l>ate and discnssion* it is not yet resolved. The itiiportanL-e of 
the distim^tion l>etween proximate and ultimate objei^tives of instniL-tion 
for eduL-ational test developers was artii'ulated .several years zgo by Lind- 
qiiist (1951). 

EdiiL-ationul practii-e generally assumes that the knowledge and L-apal>il- 
ities with whii'ti the learner leaves the classroom are related to the educa- 
tional goals envisioned by .^eiety. This assumption implies that the long- 
ntuge goals the learni^rs are to attain in the future are known and that 
\\\c l)chaviors with which tbi* learuers leave a partieular course ai-tiially 
contribute to the attainment of these goals. What is closer to reality^ 
however* is that the long-tenn relatioaship l^etween what the student is 
taught and the way he is eventually required to l>ehavc-' in soL-iety is not 
very dear (Glascr & Nitko, 1971). 

In L-ontrast to ultimate goals* proximate goals define the domain^ of 
performance that a learner displays at the end of a partii-ular instniL-tional 
situation (e.g.t course or grade level). It should l)e noted that proximate 
ulqectives are not defined as the materials of instruL-tion nor as the particu- 
Jarjsel^^onc^t. items that have l^een used >nj,hc instnit;Hona] sjh 
For example* at the end of a course ju spelling one might reasonal>[y 
expcet a student to l>e able to spell L^ertam classes of words from diL-tation> 
During Uit L-ourse* certain of these words might have been used as examples 
or aspraetiL-e exernses. The iastruLtor would l>e interested in the student^s 
performanee with respeet \<s the class or domain of words as a proximate 
objective of instniction and not the particular words used in instniL'tion. 
Thus to assess a stiident*s performance with respect to a domain, one 
mty neeil to conskler tlie transfer rehiionship between the itetns in the 
donuiin and the preceding instmction- 

Ccneral Nature of Domain Specification 

The specification of the domain of instruetionatly relevant ai'hievement 
behaviors L'an profit much from the suggestioas for ''univer.sc speufieation*' 
advocated by Cronhach (J97I), As Cionbach has pointed ont, too often 
attention is paid only to the selection of subject -matter topics. The nature 




7 



NITICO ft3 



of the stimulus and the descnption of the response are ignored. Proper 
domain specification requires that both stJmiilus and response descnptions 
he mcluiiecL Thus 

A proper response specific^ition deals with the result a person Is asked 
to pruthice, not the process{es) hy which he succeeds or fails. 'Reads 
printed Wf)fds a1o«id' Is a deiteription of an observable response; it says 
nothing altout whether the reader is to look and or to MHiml the 
won) out. A person who iiui^ted on scparatinf^ these two response 
proc-esscs wmild have to devise a new task speejfieat ion, perhaips refjnir- 
ing the reading of nomenjte constructions that no Mibjcct h»s seen before. 
If a category of the form say^ 'ability to evabmtc arj^umcnts' is to mean 
an)lbing as a tojJc spttcificatinii, the designation niiLSt be fteshed out 
to dcs<:nl>o something a1>mit the stimnbis, the accompanying injnnction 
to the subject, ami the aspect of the Ijehavior to which the scorer is 
<1irecte<1 to attend (Cronbach, 1971, p. 454). 

In this sensct iisc of the Taxonomy of Educational Ohjecdves (Bloom, 
]f)56) is insiiificieiit for domain specification since the categories dcscrilKKl 
therein are inferred ptychologtcal processes. However, to adequately spec^ 
ify the dimcnsu>fts of the performances to I>e included in the domain, 
ODC may need to invoke a theory of performance (Ilively, 1970; Miller, 
1962) to decide which stiinnlux and response characterutlcs are relevant 
for domain description. This point will arise ugaiu when deriving tasks 
from the l>ehavior description is discussed. 

DenvatUm of Domain Descripfion 

While in practice the generation of performance domains Is often ul- 
timately tied to the actual specification of the tasks (stimuli) themselves 
this (derivational process is discussed separately here. It should l)e noted, 
however, that the state of the technology for determining the content 
,aiKLattribute3LoLwAdtis.learned.isjioL«wclLdevelopC(l,.purticuUrly^wh^ 
behavioral characterlstic'sofc'oniplcx school-like performances is concernetl 
(Glaser & Resnick, 1972). 

One practical method.for derlvm;; domain descriptions for smaller classes 
of l)ehavior, siich asadomsiin of behavior relevant fora itnit^ of instnicrion, 
is the procedure siemming out of Gagncs work on learning hierarchies 
(e.g., Gagrte & Paradise, 1901). (A modification of this procedure, which 
seems to give more replicablc results, has l>een provided by Resnick 
(Resnick, Wang, & Kaplan, 1970),) The analysis of learning hierarchies 
begins with any desired iii-structional objective, hchaviorally stated, and 
aslcs in effect: To perform this l>ehavior what prere<{Uisite or ct>mpoiient 
Ijehaviors ma^t the learuer \>c able to perform? For each behavior so 
identified, the same question is asked, thus generating an ordered hierarchy 

'Th« Muitysfs ol learning hierarchies need not tie restrtcled to ttilH of instnicilon, of 
vouncv It miy }k possible to ippty the pfoc-edurc to hio^d currlcutir w*s. 




8 



64 PHOBLEMS IN CRITEHION REFERENCED MEASUREMENT 

of behaviors Ixased on testable prcrecjiiisites. The analysis can l>cgm at 
any level and always Jtpecifies what comes earlier in the ciirriciiliim. It 
dioulcl l>e noted that as it is used here, hiCTarchy analysis in a tool for 
domain definition. M/hether all students' learning should progress throtigli 
the hierarchy in the same way is an empirical question for instructional 
psychology. 

As a result of this type of analysts and domain specification, the test 
developer is provided with the essential information about what l>ehavion; 
are to' he oliserved and tested iit order to dctcnnine the stains of the 
learner with respect \o the achievement continuum. Thiis a hierarchical 
analysis provides a good map on which the attainment, in performance 
terms, of an individual student may l>e located. The uses of such hierarchies 
in designing a testiug program for a particular itistnictioiial system are 
descril>ed elsewhere (Claser & Nitfco, 1971). 

A serious question that can bo raised is how much of edncalion can 
he analyzed into hierarchical structures. The answer to the (jnestion is 
very much an opeiv experimental matter. Three things should l>c noted, 
however (Claser 4c Nitko> 1971)* First, the development of hierarchies 
for complex Iwhaviors may lead to several siicli structures, each of which 
is ''vaticT' with different kinds of learners, but none of which, taken alone, 
is valid for all learners. Second, ttie analy.sis of l>ehaviors into components 
and prerequisites leads to .stnictiircs tliat .stand as hyi)othe.ses open to 
empirical verification. Thinl in actual Jnstnictional practice there is always 
a functional 5e<iuence wherein the instructor has at least un intuitive 
hierarchy through which he proceeds. 

Another point to rememl>er is that criterion-referenced interpretations 
are mast iL*;eful when the l>ehavior domain ha^ an orderly progression 
(Cronhach, 1970). Hierarchy analysis^ or a .similar procedure, would .seem 
to be a useful tool in discovering these progressions. 

TTi e tise (o vjiich the test is to l)e put wiTl jo ajijjge^exleiit, <ieiejcmia^ 
the nature of the performance to l>e included in the domain definition. 
For example^ one may develop performance domauis by analysis of an 
' experts" l>ehavior or by the analysis of an **aniatenr's'' behavior (llivcly, 
1970). It may well be that certain elements of perfoniiaiice will drop 
out as task proficiency increase.s. For assessment of initialstages of learning, 
therefore^ it may l>c that more components need to l>e inchidcd in the 
domain definition (and consequently on the test) than at later .stages of 
learning. This would seem to imply a distinction l>etween diagnosis, place- 
n>ent, and final (terminal) learning assessment (see Claser & Nitko, 1971), 

DEFINING CLASSES OF ITEMS 
Closely associate<l with the definition of behavior clasi^es related to levels 
~ of achievement is the translation of these l>ehavioral .statements into sets 
of test situations-test tasks or test itcn*s. Although discussed here sepa- 



er|c 



9 



NlTKO 65 



rately, in practice tlicse two jctc]u -4tq often iterative. Performance domains 
tend to Ix! ver1>al £t;itcmcfits ;ifid descriptions (eg., behavioral objectives) 
whereas test jcjtiiation deixrriptions tend to !>c more concrete in that the 
characteristics of the testing situation and tlic various type of admissible 
test iteinjc arc mapp<Kl out am) specified. Teai items here refer to any 
carefully dcscril>6d . . jctiinulus conditions under which a student is 
expected to respond, together with the specifications for recording arwl 
M.*ontig h\$. response when it occurs (Mivcly, 1970)." Items include lK)th 
perfonnanc*e and traditiontil paper*and^pcncjl typeK of items as long as 
tl>cse arc derived from the domain definitions. 

Item Forms 

A useful tool for critcrion^referenc^d tests is item forms analysis (Hivcly, 
1066; HIvely, Maxwell, Ral>ehl, Scnsion, & LiiiKlin, 1973; Htvely, Patterson, 
& Page, 1968; Qslnirn, 1968). Item forms analysis is a variation on task 
unalysu. It is the process ^vbereby behavioral statements arc analyzed 
in order to derive classes of items which elicit the various aspects of the 
Ixihavior cUss. As a raqilt of this analysts, one or more item forms are 
derive^] for e^^ch Ixihavior clas^s. An item foriii consists of a specification 
of the invariant part^ of tlic cla\s of itcins together ^vith (a) an indication 
of Mvhic-h parts of the items are variable, (1>) a specification of elements 
vi'hich can l>c iLScd in tlic variable parts of tlie items, and (c) a specification 
of the niles by which one selc(.^s an clement from the set of var[;ible 
.elements to derive a particular item (Hively, 1970; Iltvcly, ct al., 1973). 
The variant jwrt of the item is called a shclh tlic sets of elements which 
can )x! used in the variable parts arc called replacement sets; and the 
nilcs by which oivc samples from the replacement sets arc called the 
fqylaecnietit stmcture (llivcly^ 1970; Hively, et al.^ 1973). 

In prurtice, oivc often cannot go dirccily from a verl>al statement of 
a bchavior-class^to aivitent fornh The-procedure^iisnHlly*tS'to-first develop 
prmotyix! items admissible as test tasks under tlie descrilxid l>cttavior. 
Phxcvt and comjwiicnt analysis (cf. Resnick, Wang, & Kaplan, 1970) of 
thase prototype items often lends to a modification of the original l>ehavior 
spcctficatioit, cliiiiiiiation of some of the prototype items as not implied 
by the Ixihavior cla^, or a rewriting of the prototype items. In examining 
these prototype itcins to detcrniinc tbctr fit to the l>ehavioral definition 
one invdkcs a Iwbavioral ar>;ilysis and a theory of ])erformancc. This process 
involves more than superficial judgment and sorting. The questions that 
ikmhI to Ix! answered iirc: (1) Does this item contain the stimiilas charac-^ 
teriUics implied by tlvc Ixihavioral statement? (2) Will the examinee's 
response to this item Ix! indicative that be indeed has the dasired response 
in his repertoire? 

Otkce the set of prot<itype items has Ixicn delineated item forms can 
l>eindiice<l.Tlic pmtoty]>e item is one mcml>er of the class of items implied 



ERIC 



10 



66 PROBI.BMS IN CRITFJUON- REFERENCED MEASUREMENT 

by m item form. The task here is to fdeibtify Oie general form (format) 
of the itemt. the item iiwM, the variable elements, anil the admissible 
replacement sets* Again, this impliex u 1)ebaviornl sinalysis and a theoiy 
of perfonitanee. 

Item Tttfout Dttfa 

M i)art of the procedure for defiiniif; test taskx that are coni;i5tent with 
domain dcfinttionif, it is necessary tn esta}>H!fh empirical procedures for 
tiyoiit of items. A major purpose of trailitioiml item-tryoiit procedures 
IS to c^ollcct da^a ikece!«ury to improve the text itcmSi This is no less true 
when c^riterion-refereiiecd test items are develo|>eiJ. 

T^ont of items forcriterion-refereiwcfl test development seeks to further 
refine and {xJish tlie doinuiu of test tasks. All the ambiguities that are 
inherent in traditional item writing are inherent in eriterion-referenced 
item writing. Further since item forms are devcloiR*d iisinj; l>chavioral 
analysis and |)erformiince theoiy^ the data from item tiyoiit arc used to 
check on the a<lcf]iKic}' of this imtial analysis. Often this will lead to a 
respecificatfon of the item form or one or more of Its components 
-^replac^nieut sets or replacement stnictiirc (ef, Osliiirn, 196B). 

Tlfc^rc are those who advocate cither exi>ljcitly (e.g., Stenner & Wclwter, 
1971) or implicitly (e.g.. B;ikcr, 1971) tliat items designed to test a specific 
class of l>eb:iviors ]>c homogeneous. Iloiiiogencoiis tends to l>c defiiwtl 
in tenns of item aiKl total test score parameters such ^s discrimmatinn 
indices and internal eoosiHtency reliability cstiiibates. These eorrela- 
tion-relatcHl iiidiees tend to l>c inaximi^cil when each item measures the 
sante faitor (process) (l^rd, 1938). Tbc insiHtencc on homogeneity in this 
sease is tiK> ^^eepiiig mid is poor ]>s)'cliologv\ It Ecailt to statistical tech- 
nifjiies l)eing iivcfl to drive the itefiiiition of j>erforinanee <tom;iinSi There 
H no logii^l UviSis for c^ntoiidiiig a priori that any d<imain of perforiiiaiu.*e 
,.iiJcilU^sd,!isJtiMni<;tionallXJCc'lcvant ought to I>c J)otiiogeneous.(eL.Cron^ - - 
bach, 1971). Ilotiiogcneity sliould Ihj viewed as a rjifccstioii for empirical 
experiuteiitation aiKl item j)erfonnanee theory (cf. Bonnnth, 1970) and 
woiikl prolialdy vary with tl»c tnrget population and the class of iR'haviors 
uockr convideratioiL Iletenit^etieity would mean tbat a liiirger iuiidIrt of 
ot)$ervations are iieedcfl l>cfore lukfjuate gcwralizjitions al>oiit domain 
perfonnance can l)e maile. 

Hicrarchy ValidaiHm 

If hierarcliy anal)'sis Is used to develop the test domafn, empirical (lata 
nceils to l)c collected to valiihite this stnictnre in terms of the items defining 
the virions levels nf the hicrarchy. One should ilisting^iish what might 
1)0 c;illetl the "ps\x*boinctric'' hierarchy* from the leantiug hierarcliy. 

*F4»r»neNUiii|)lf:of|KtR^I(trcstk^to\'a|jJiUp%>x-hoinelrKhf^^ Rc^iiiil 
4ntt Uc-iAJr 11971} ami Kcrpi^ (l!l70); 

11 




NUKO 67 



Cltsses of test t;isks (items) can )>e ordered in hierarchical ways which 
nay liear little relatioiLship to the sequence in which learning should 
proceed If the hierarchical ordering of the domain impties an instructional 1 
se<|uencet or if it represents a hypothesis about l>ehavioral acquisition 
derived from instructional theory, then empirical transfer studies are 
reffiim) as well. TIuls, criterion* referenced testing is not exempt Trom 
cofistnict valkLitton stitdies (cf. Cronl>acli, 1970). 

liem Pfffonnant^ ami Insiruciion 

An important consideration in the tryout of test items in this context 
is the relationidiip l)etween instruction and the test item domain. The 
tryout data ts dependent on: "(1) the characteristics of the item itself* 
(2) the program of instrnc-tioii with which tt is associated, (3) the sample 
of the stmlcnLs from whom the data were collected, and (4) the conditions 
tUKler which the students worked (Itively, P^T)" These are factors 
wliich iiifltietK-e the interpretation of tryout data and tlie Miltsequcnt 
decisions that are made concerning item and domain revision. 

]f the l)ehavioral domain and stilisef]iicntly derived item classes are baseil 
on some inferred process (c.g.t application in the Bloom Taxonomy) or 
an inferred Psychological construct (e.g., a hierarchy of prercqtitsitel>ehav* , 
lOfs), then the eContent ami nature of the examinees' prevfons learning 
histofy (i,c., iiLstnictioit) need to l)c considered in interpreting tryout data. 
A similar point is made by Bormnth (1970) who calls for the development 
of procedures for relating the structure of the items to the stnictiire of 
the instnictioit. For example, tn adc^iuately derive classes of test tasks 
measuring transfer application, and evaluation Iwhaviors it is tHicessary 
to eUinfnatc from the item fonn lhos<! iteins on which the students were 
given practice^ thns leaving those iteinK that elicit responses not explicitly 
taught, Imt which can l)e deduced from instruetimi. Without Mieh proce- 
(hues. it is not j>ossihle to (Jclermine whether the cla^cses of items are 
iiHlecil achievement items, as opposed to geoerallcnowlcirgc or aptitmie 
items* 

The development of items for eritcrion*refereuce<l tests and the as* 
JDciMcd einpiric:al d;tta generated liy trynnt and study of these classes 
of items seem to eail for aspects of achievement te.%t theory that are as 
yet not well developed, Bormnth labels these liem*ii;nling i/ieary and 
Ui'm*rmt^>^tu: ihcory. Item^writhig theory wonld lead to the development 
of proce<lnres for (Icfining classes of items (item forms) and item*respon^e 
theory wMtld lead to explanations of the processes that aci-ount for re- 
sponses to classes of items. The developer of criterion-referenced tests 
should refer to Bonnnth's \)oo)i for suggestions along these lines arK) for 
indications of some of the protileius involved in pursuing research in these 
areas. It should l>c emphasised that theories and research in these areas 
are currently inadequate or coinpietely lacking. 



ERLC 



12 



W PROBLEMS IN CBtTERIOS-REFERENCED MEASUREMENT 



SEUCTING ITEMS TO APPEAR ON THE TEST 

Odcc the behavior domain and the classes of items have )>een specified 
6)e final stages of test development can proceed. It might foe arg^ied that 
the (mcedthg <Kscus5ioii concwning domain definition is no more than 
what any test developer should do in order to maximize content validity, 
regardless of whether a criterion-refetenced or a nonn-referetUK*d test is 
to be developed AVhite this is probably more of a fond hope than a reality, 
one ts still inclined to agree that perhaps all test developers should take 
aitch care in developing tests* It .diould foe noted, however, that content 
validity implies an indication of the sampling plan by which the particular 
items that appear on a particular test form are selected from the domain 
of all items (Gronbach, 1970). 

It is assumed here that empirical data and performance theory support 
the drfnttions of achievement levels in the domain and the classes of 
test tasks operationalizing these behavior classes. The task i$ to .^telect 
item^ to P^tt on a form of the test in such a Mray that performance on 
that test will be a basis for an inference about the examinee s performance 
in the domain. It has already been mentioned that criterion-referenced 
test score interpretation i$ mm meaningful when the behavior domain 
has an orderly progression. This implies taking advantage of the p^cho- 
logical stnicture of the subject-matter domain in selecting test items. 

Exampks of Item Selection for Curriculum Placement ' 

If an instnictional system is adaptive, it will avoid teaching the student 
that which he has already learned and will instead oifer him new goals 
to learn. Information is needed to answer the question, "Where in the 
instnictional sequence should the student begin his study?" Tests built 
to provide this information are specific to the content and psychological 
stnicture of the particular course of instniction with whiph the student 
is faced. 

In broad areas such as an entire course or an entire curriculum area, 
neat hierarchies of the Gagn^ type covering the entire course of instruction 
may not exist or may become very complicated. Neverthel^ some se- 
quencing of instructional objectives is possible. An illustration of this is 
shown in Figure 1 in which an elementary school mathematics curriculum 
has been defined in terms of approximately 350 instructional otijectives. 
The content has been broken down into ten topics which are rou^ly 
in a prerequisite order (from top to bottom in the figure). Further, each 
topic has been developed over a range of complex behaviors that are 
also in a rough prerequisite order (from Level A through Level G in 
the figvre}. E^uh cell of the grid represents several instnictional objectives 
and is called a unit of instniction. The ot>jectives in a unit of instruction 
can usually foe arranged in a hierarchy that leads to a few terminal goals 
for that unit. The inset shows (hypothetically) how a short sequence of 



ERIC 



13 



NITKO 69 



olijectives might look for one unit of instruction. Within a single unit^ 
in genera), there will l>e prerequisite behaviors from earlier topics and 
lower levels. These ar^ tal>eled as behaviors A, B, C and D in the inset. 
One way to ptacc a pupil in this curriculum is to develop a two-stag^ 





iffil «f Ctiiplfittjr 






m 
■ 




p 


E 


- f 




NumtratiM/PlKe Value 
















Addition/Subtraction 
















Muttiptiation 




* 












DnHsibn 
















FiKtions 
















Mooey 


* 


* 


* 










Time 


* 


* 






* 






Systems of Meiswement 
















Ceonetrjr 
















ApfrficitioAS 




* 













1 


J 


r 


t 










1 

F 




1 


\i 


E 




1 


1 



□ H 0 H 

*liidi(;at«5 a unit of iastniction cousisling of one or more insiniclional objettivcs. 

FigMfe h Example of Cunicttiun Layout for Indtvfihially Prescribed 
Instruction Elementary Mathematics 



erIc 



70 IHOHLKMS IN CIllTfCfUON-HKi'Klli: NCW) MKASUIlKNtKNT 



MMHCMItnCS PUCEMENT PMf HE 

iMiT ^^Ih/^-ri^^^attl^^^ lut* _.ef»«e 

Scbwl ^M/ut^,^^ TcMbtr /'^^f.^^itlA^™ Room 





flKHitnt Livtl 


at 


ilm 




c 


D 


E 


F 






Numcfation/PlKe VaFue 














/ 


Additiofl/Subtraction 




,-- ■. 












Mu^ipliatkm ' 




^ ^ - ■ 












Division 














P 


Fractions 














B 


Money 






■ : ^ X-.:^- 










Time 






-■ ' ^ ■ y- 










Systems of Measurement 






'■^ -^^ 










Geometry 














£ 


Appfkjtwfks 

















FiguK X Eictmpic of PlacciMKt Profile for % Hypothetical Student witH 
Respect to the Mathematicji Caniciiloiit of liMllvi<Nia|]y Pre&ciiM Instrtictiofi 

plac^-cment test (Cox h Boston, IM7). Tlic first^stuj^c test is brou()'ranf;c() 
over the curncnhim. The rasiilts are used to plac-c a student at a unit 
in each topic or content area. The secontkstaf;* test is narrow-ranged 
am) tests the domain of )>ch;ivior iinplic<l by 'a sdn^le unit. The rasiilt<i 
are iiscd to place a student ;it a particular ol>|ective within a unit. The 
first'<itaf;e test needs to l>e adtniitistered only once at tlie l>ef;inning of 
a course of study. After completin;; instruction on the first emit of study, 
the Miitient is given the .second-stage test for the next setjiicntial nnil. 
Thus, he is placed at euch Micce.ssive unit ia tlie. curriciiliim. Fi^rre 2 
iJH)wsa completed first-stnge placement profile for a liypothctieal student. 
Figure 3 showi whnt a completed second-stage placement pmfile might 
took like. 

The hroad-rftnge te.st is actually a l>attery of tests coa^sting of one 
tc^t for each topic. Each Mit>ject would predict for each topic the la^st 
unit in the st^quence from A to C in which tliestucfcnt would t>c succ^cssfut. 
Traditional iteiU'Seleetion procedures that seek to muximi/e predictive 
validity would seem appmpriate for this type of hroad-range t&M. tf the 
l>ehavjors defined within a unit arc hierarehical, then one could .wiect 

15. 




NITKO 71 





HfUfe X Ptacencfit P^le for a Hypothetical Studeit (ShaM boxes mean 
tliat «Im rtvient Jlis MfficieiH maMefy of these JnstniclkHial foiH to proceed 
with a iww iftstiuclioval foal.) 

iterm from the dotivaJris that define the terminal ohfcctives for that unit> 
aiKl depend on the prcrcH)tiistte nature of the hierarchy to siil>5iime the 
other t)chaviors In the unit, tf a within-nnit hierarchy does not exists then 
selecting items from the (tomains of alt the wjthin-unjt behaviors would 
seem to t>c required. Care shmttd be taken, however^ in using correlationa] 
indices for this type of prediction; it i<i the alH^tnte tevet of attainment 
of unit ^kWk that iv of prime importance. 

The setfMKlstage type of itiift test serves as another example of how 
items mi^t t)e selected hy takin;^ advantage of the psyehotogic^al stnieture 
of the sii1)}cet-matter content* If the unit behaviors are hierarchieat and 
dotivains of items arc defiixxt for each oode in the hierarchy^ then a 
1>ranche<l tost can t>e used to ot>talti a piipirs profile with respeet to this 
hierarchy. Thus, if ao examinee was succe3;sfu) on items testing one objee- 
tive in the hierarchy, this would indieate that items from earlier objectives 
in the hierarchy would t>e passed as well.'' procedures for branched testing 
initially proposed hy Ferguson (1970) and fitrther eiat>orated by Hsu (Fer- 
guson & llsu> 1971; JJm] & Carlson, 1972) have beeo successfully used 
10 ail elemeotary mathemalics curricubtm when coupled with item forms 
and a computer. 



*S<ich eljkorate pr^KixlMrfs wmid hive to Ire liuUnceJ <Hit ig^imt efficiency cHt«H«. 
For cxafni^'* in tmall hfcunhicit cnn%i%ti»f$ of « few notk\ »■ taiktrotl tetl wotild l>e more 
cU xir^lo ilt;tii o<x-€V(«r>', A siimJ^miI mipibt l>c- plitcetl more ^iijokly and efficiently hy fimply 
ItMinj; Jl nrMlcs. 



ERIC 



1'6 



72 PROBLEMS IN CRITERION-REFERENCED MEASUH^:^1ENT 

Figiir^ 4 IS a scbciDUtic illustration of tenninal and prerequisite instnic- 
tioiial ol>jcctives for an aiklitioii'siubtraetion unit from the elementary- 
arithmetie curriculum of the Individually Prescrilwd Instniction Project 
(Undvall & Bolvin, 1967). Each Iwx represents one oljjective. The oljjec- 
tiwi are arranged in a branched hierarchy. Objectives 6, 17^ an<l 18 are 
terminal dfjectives for the unit; the remaining objcolives are prerequisites, 
Eich of these prere<|ui.sites and terminal oljjectives is defin<xl by one or 









1 




I 




2 




3 



Fifem 4- An Example of a Hierarchy of Skl\\% In ati IPI Mathemitfcs Utitt 



ERIC 



17 



NITKO 73 



more item fonns whkh are then programmed for use on the computer 
The testing is done on an individual basis at a computer terminal. 

The oliject of the testing scheme is to locate a pupil at one of these 
olijectives or "boxes" as quickly as possible and in such a way that be 
demonstrates mastery of objectives below his location and non-mastery 
of objectives alx)ve his location. The decisions for which the testing proce- 
dure must provide information are (1) what objectives should be tested 
and (2) whether the piipil has mastery or non-mastery* of the objectives 
that are tested A decision needs to be made alx)ut every objective, but 
the trick is to make these decisions without testing every objective^ and 
to minimize the testing for those objectives that are tested. 

On thU basis, a set of decision niles is devised that combines the capabil- 




Ha: p .85 (Sludent has sufficient mastery. mW irtstruttjon) 

H,: p ' .60 (Student does nol have sufftcient mastery, give instruclion) 

Figure 5. Gra^^h Illustrating Se<|iieiitJil Fro^Hllty Ratio Test for' 
Dctermiiilne Whetlter i Student Does or Does Not Need Instnictioo on in 
Objective (Modified from Ferguson, 1970) 



ERIC 



''Hy mfutciy it nic^nt th^it ". . - examinee make^ a MifRcient numt)CT of correct 
Tc^p(mse^ on the sample of tcsl itcnit pre^^nted to him u\ onW io »ipport the f^neral)X»ll<in 
(from t]w s4in{>t^ to the (Jom^iin or tinivcrse of ilcms implied by in jnstiuclioiul objective) 
Ihat fte ha.s $iMiiin«d the dcsh^ed^ prt-'SpeciBed degree of proBciency with respect to the 
<jom»n (Cljwr At Nitto. im p.WJ)/" 



18 



74 PROBLEMS IN CRITERION- REFERENCED MEASUREMENT 

tties of tbe computer with statistical logic arfed subject-matter logic* This 
alkrtvs "on-line*' decisions to foe made about what is to foe tested and 
how extensively it is to foe tested The procedure hreaks away from the 
tradtti<Mial "test now> decide later" schemes that have received recent 
ciiticism (e.g., Green> 1969), 

A decision about mastery of one objective can be made by using the 
sequential probability ratio (Wald» 1947). An example of the situation 
is shown in Figure 5, The test length varies from pupil to pupil. A pupil 
is given onfy as many randomly-selected test items as are necessary to 
make a mastery or non-mastery decision with respect to a fjjied maslery 
criterion and with prespecified lype 1 and Type 11 error rates, After 
each Item is administered and scored, a decision is made to declare mastery, 
continue testing, or to declare non-mastery. With the nnm1>er of items 
a random variable, it is possible, in this example^ to make a mastery decision 
with as few as 6 items and a non-mastery decision with as few as 2 items. 
Not all mastery and non-mastery decistoas are made this quickly; it depends 
on the response pattern of the pupiL 

Figure 5 illustrates the procedure for one objective. The problem that 
remains is that a dectinon needs to Ije made about every objective. Since 
the objectives are organized into a prerequisite sequence, the sequence 
itself can l>e used in the decision-makJng process, This results in the 
compo^uid branching mie shown in Table 1 for determining the next 

Table 1* BruKbing RiHts for Cam^tti-Asststtd Placement Tesfine 



DedsionhK 
1 Skill 


Pupil's Response 
Dita (p) 


franchfne Rules 
(lltKt SIdll to be Tested) 


Mistefy 


HIGH 


Branch ip lo hifhost untested skill. 


LOW 


Branch ip to skilf laidttay between \hh skill 
ind highest untested skill 


NM'Mistefy 


HIGH 


Branch dsim to skill midwiy between this skilt 
and lowrest untested skill. 


LOW 


Branch down to hnmt untested skill. 



objective to be tested. The "next objective to be tested" depends on whether 
the student Is declared a master or a non-mastcr and on his response 
pattern that led to this decision* This is illustrated by the arrows sketched 
on Figure 6. 

Testing begins at an objective in the mid^lle of the hierarchy and 
continues tmtil tbe branching rule cannot i>e satisfied. At that pointy the 
objective tested is tbe proper location of the student in tbe hierarchy* 



NITKO 75 



ERIC 









H 






14 






I 


J 



^ Flfw^ ^ Ail Exaiiqvle of the An>IJcation of the Branching Rules of Table 1 
to the IP! Mathematicii Unit in One Instance 

(Note: Only one of the '*;irmw<i'* would t>e fo11owe<I to loc^ite the next ohjectfve 
to be tested Ihc hriinching niles would Im; reapplied after testing the next 
ot^jectivc.) 

Untested skills can t>e assume<t mastered or iinmastered ao(X)rcling to their 
position in thchicrarohy and the student s response data. 

An ituUvjdiiars teMing session results in a proRIe similar to the one 
shown earlier in Figiuc 3. The student wovild begin his instruction in 
this unit on the next sefjiiential objective th;it was unmastered. 

£lal)orations on how items are selected and generated from item forms 
by the computer arc given elsewhere (FergiLvon & IIsu* 1971; Hsii Ac 

.20 



76 PROBLEMS IN CRlTERlON-REFRReNCeD MEASUREMENT 

Cirkon, 1972). Figure 7 is a flow chart that illustrates the item selection, 
administration, scoring, and decision-making procedures In the testing 
.situation. It .should l»e noted th^t this type of criterion-referenced 
branched testing is still in the developmental stage and that evidence 
concerning its appropriateness needs to be provided l>efore tt can be 
strongly recommended. 



IfSnMCMMAOK 
•kfedfvf to I* Ititd 



TESTlHfi MIMGEM 



TESIWG WMACEft 



ITEM AOtflNlSTRATOft 



TESIMC mn^lK 




NO 




YES 



Otti 



Figure 7. Execvtbo Model for Pt^tests and Posttests Using |tem*Chister 
GcMfitors (Adtp4e4 froai Ferfuson A Hsu, 1971) 



ERLC 



21 



NITKO 77 



CRITERION^REFERENCED TEST SCORES 

Criterion-referenced test scores tead to an inference about the perfor- 
tnance characteristics of the examinee. Such jccores indicate the behaviors 
the examinee can exhibit with respect to a defined domain of behaviors. 
These scores are derived scores in the sense that their interpretation is 
l>a5cd on the psychological stmcture underlying the behavior domain. 

In the examples illustrated in figMres 2 and 3, the unit of instruction 
and the node jn the hierarchy are defined by classes of Iwbaviors. A 
particular score on the geometry subtest^ for example^ might mean that 
the examinee can perform alt lower-level behaviors up to and including: 
identifying pictures of open continuous curves, lines, line segments, and 
lays; stating how these are related to each other; writing symlK>)]C names 
for ^>ec](ic itlustratioas of them; identifying pictures of intersecting and 
non-intersecting tines; and naming points of inter^tion. The score would 
al^ mean that the examinee could not demonstrate highor-level behaviors. 

Scores niay also l)e related to expectancy tables, thus indicating the 
probabilities associated with various score-behavior class performance 
combinations (Cronl)ach, 1970). This would eombine norm-group data with 
performance data and aid in the overall Interpretation of performance 
not tested. For example, relating acquired levels of performance to chances 
of Iwing successful JO new instructional situations hroadens the interpreta- 
tion of criterion-referenced scores. Obviously^ normed-referenced scores 
such as percentile ranks, standard scores, grade equivalents, and so on 
can l>e ol>tatned from criterion-referenced tests as well. 

An ]$;tie often closely associated with criterion*rcferenced testing is that 
of mastery learning amt mastery testing. A full discussion of mastery testing 
is l>eyond the scope of this paper. The reader is referred to papers by 
Bloom, Hastings, atxt Madaus (1971), Block (1972), Bormuth (1971), Ebel 
(1970), and Claser and Nitico (1971), for some discussion of this problem 
as it relates to testing. It is noted here that a criterion-referenced test 
does not nccassarily imply flawless performance nor that any examinee 
nec*essarily meet a given standard of cx>mpetcnce. What is implied, how- 
ever, is ttic notion that such tevcb of competency be defined in terms 
of performance (Nitko, 1970). 

INSTRUCTIONAL SYSTEMS AND TESTING 

It is important to point out that the kinds of tests that are developed 
and U5e<t witl depend on the decision framework within which the test- 
provided information is employed (c.g.> Cronbach & Glaser. ld<>5). It has 
been indicated (bat criterion-referenced tests will, probably find their 
greatest use in instnict tonal situations. Since there are a variety of ways 
in which Instructional systems can he designed and operated to adapt 
to individual differences (Cronbach, 1967), the design of testing programs 
needs to take the instructional system into accotint. This means that varioiK 



22 



78 PROBLKMSiNaiiTKRION-REFKRENCED MEASUHKMENT 



mixture of criterion-referenced aiul norm -referenced test vurietiei^ will 
be needed depending on the particular instnictionaJ system. Thus, in the 
overall planning and designing of a testing program, decisions alMiiit when 
(aiKl whether) critenon-refcrenced tests are to be used need to l>e made. 

One example of how criterion^referenced and other types of test infor- 
mation can l>e designed into a particular kind of individualized iastnjctional 
^em has 1>een given \ry Claser and Nitko (1971). The diNCiission there 
indicates bow the various kinds of instructional decisions that need to 
l>e made are determined as well as the kinds of tests that need to l>c 
developed to provide this kind of information. Similar analyses of other 
types of instnjctioiial systems need to l>e made and testing programs need 
to Ix^ developed in the context of these analyses. 

SUMMARY 

TTiis paper has reviewed the requirements for the construction of critc- 
rion^rcferencetl tests that would l>e used in instnietional sitimtions. It has 
trial to indic*ate the prd>lems faced in the practical construction of such 
tests and some of the techniques that have l>e€n found to t>e of some 
value in .solving thc^e problems. Adequate sohitions do not exist for all 
01 the prol>lems raised. Jn particular, procedures are needed for the solution 
of the following problems: 

1. Defining thel»chaviors to l>e taught and tested for in the instnietionat 
sitintion. 

2. Task analysis as it relate:; to schooMike l^chaviors. 

3. Relationship l)etween what is tasted and the ultimate objectives 
of the uidividiuil :ind society. 

4. Hie relationship l^ctwecn the l>ehavionil domain and the domain 
of tasks serving as ttie potential item domain. 

5. S]x;cific;ition of the domain of tasks in terms of their stimulus and 
res^x>nse charaeteristics. 

6. Hie ordering of the domain of behaviors in terms of their psycho* 
logical structure. 

7. Data relatetl to the generali/ability of samples of behavior to the 
l>chavioral domain. 

8. Cunstmct validation of proposed orderings of the In^havtoral domain. 

9. Tlie development of an item-writing theory and an item-resp(iiisc 
theory. 

10. Development of procedures for ileterniining mastery of tdentine<l 
tx^havior 

While solutions to the alMive problems would lead to improved crite- 
rion-referenced test construction practices^ it should not t>e assumed that 
eriterion* referenced information is all that is needed to make instructional 

O 

ERIC no ■ 



NITKO 79 



ilecisioas. Without an analysis of the kindx of instrtictional deciMoas (hat 
ne€<l to \yo macte in a given instructional situation* disc-ussionx a1>oiit tests. 
t€?itji)g pfOce(]ureSi aiKl t&st development tend to l>c fruitless. 

REFERENCES 

Baker, E.L. The eifects of manipulated item writing coastraints on the 
homogeneity of test Wqius Journal of Educatioual Measurement, 1971, 
fi, 305-309. 

Block, J.H. Sttident evaluation; Toward the setting of rational, crite- 
rion-referenced |)erformaiiee jttandurds. A paper presented at the 
annual meeting of the American Educational Rcseareh Assoeiation, 
Chicago, 1972. 

Bloom* B.S., et al (Etis.) Tawnomy of educational obfectives, Aaiu/fwoJt 
/; Cognitive chniiiiu. New York; David McKay^ 1956. 

Hlootn, B.S., Ilasitingx* T.M** & Maftaus, G.F. Handbook on formative and 
summative ev(duatton of jttudeut leuming. New York: McCraw-Hill, 



Borrautli* J.R. On the thany of achievement test items. Chieago: University 

of Chicago Press, J970. 
Bonnuth, J.R. f>}velo|>fnetit of standards of readability: Toward a rational 

critr;rion of pa'tjwigc performance. Final Report* USDIIEW* Projeet 

Nt). !M>237. Chicago; The Univer^sity of Chicago, 1971. 
Bnincr J*S* Toxcartl a theory of instructi(m. Cambridge* Mass.: The Belknap 

Press of ilarvarti University Press* 1966. " * 
Cox* R.C, & Boston* M.E. Diagnosis of pupil achievement in the Individ^ 

iially Prcscril)cd Instruction Project. Working Paper 15. PittsJmrgh* 

Pa.: University of Pittslnirgh, learning Research and Development 

Center* 1967. 

Cronl>ach, L.J. Flow can mstniclion 1^ adaptetl to individual ditferenc*es? 

In R.M. Gagnc (E(L), I^aming^ and indimdual differences. Cohimhus, 
■ OhIO! Charles E. Merrill, H)67. 

CronI>ach, L*J. EssentmU of fysycbolo^ical testing. (3rd ed*) New York: 

IJarper and Row, 15)70* 
Cronhich* LJ. Test validation* In R.L. Thomdike (Ed*), Educational mea- 

surmnaxt. (2nd ed.) Washington; American Council on Education, 

1971* 

Cron!>ach* L.J*, & Glawr* C^C. Psychological tests and personnel decisions. 

Urbana: University of Ilhnois Press, 1967* 
D*ivis, F,B. Criterion-referenced tests. In Testing in turmoil: A conference 

on problems and issues in educatiomd measurement Oreenwich^ 

Conn*; Educational Records Bureau* 1970* 



1971. 



ERIC 




KROBLKMS IN CW TEH ION-REFERENCED MEASU«KMENT 

Content-mmhfd te$t icores. Educational and Pxychologjlcat 
MeoMumnent, 1%2,.22. 15-25. 

EJid, RL. Some HmiUtioru of criterion-refervnceil measiirement. In Test- 
lug in turmoil: A conference on problems and issues in educational 
measumnenL Greenwich, Conn.; Ediioationat Recor<k Diii«mu, 1970. 

FffguAon, R.L. A model for computer-assisted criterion-rcfcrenced mea' 
sitrement EducaHon^ 1970, 8J, 25-31. 

Ferp^sot^ R.^ & Hmi, T.C. The tpplicatton of item generators for individ- 
mliztng mathematics teeing and instnictton. Publicatioti 1971/14. 
Pitt^rgH, Pa.: University of Pitt^rglh. Learning Research and De- 
velopment Center, 1971. 

Flanagan. J.C. UnlK scores, and norms. In E.F. Lindqiiist (Ed.). Educational 
medsufemenL Washington: American Council on Education, i95L 

Ciffiit R.M., & Paradise, N.E. Abilities and learning sets in knowledge 
acquisition, fiychotogjical Afonogrvtphs, 1901, 75. 

Glaser, R. Instructional technology and the mcaiairement of learning 

oiitciHnes. American P^ychologjist, 1963, J^. 519-52K 

Olaser, R, & Nitko, A.J. Measurement in learning and instruction. In R.L 
Thomdike (Ed), Educational measuremenL (2nd ed) Washington; 
American Council on Education, 1971, 625-670. 

Olaser, R, & Resnick, LB. Instnictional psychology^ Annual Reoiew of 
ftyc/wfcgy, 1972, 2J. 207-276. 

Green, B.F. Comments on tailored testing. In W. Holtzman (Ed.), Com- 
puter-assisted instruction^ testing, and g,uid4in€e. New York: Harper 
and Row, 1969. 

. Harris, & Stewart, D.M. Application of classical strategies to crite- 
rion-referenced tt^t construction: An example. A paper presented at 
the annual meeting of the American Educational Research Association, 
New York, 1971. 

flively, W. Preparation of a programmed course in algd>ra for secondary 
5C*hool teachers: A report to the National Science Foundation. Min^ 
nesota State Department of Education, Minnesota National Labora^ 
tofy, 1966. 

Hively, W. Domain-referenced achievement testing. A paper presented 
at the anniial meeting of the American Educational Research Associa- 
tion. Minneapolis, 1970. 

Hively, W., Maxwell G., Rabehl, G., Sension, D.. M.iindln, S. Domain- 
referenced curriculum evaluation: A techniad'harkdtxtck arui a atse 
study from the MINNBMAST Prof^i* CSE Monograph Series in Eval- 
uation, No. 1. Los Angeles; Center for the Study of EvaluaticHi, 
University of California^ 1973. 

ERIC 2S 



NITKO 81 



Htvely^ W.» Patterson^ & Page, S. A ^'universe-defined" system of 
arithiTietic achievement tests. Journal of Educational Measuranent, 
1963. 5, 275-290. 

Hsu, T.C.» Ac Carlson^ M. Oalcleaf school project: Computer-assisted 
achievement testing. Technical Report. Pittsburgh^ Pa.: University 
of Pittsburgh, Learning Research and Development Center. Pehniary. 
1972. 

Jackson, R. Developing crilerion-Fcferenced tests. ERIC/TM Report I. 
Princeton, N.J.: ERIC Clearinghouse on Tests» Measiircn^t. and 
Evaluation, IfTTl. 

Lindtjufet^ E,F. (Ed) FAucaHonal memutemenL Washington: American 
Council On Education. 1951. 

LtndvaU, CM.. & Bolvin, J.O. Programmed imtriiction in the schools: 
An application of programming principles in ''Individually Prescri1)ed 
Instruction." In P. Lange (Ed.), FfOgrammcd Instruction, 66th Year- 
book, Part II Chicago: National Society for the Study of Education, 
1967,217-254. 

Ltving^onf S^. Criterion-referenced applications of classical test theory. 

Journal of Educational Measurement, 1972, 9, 13-26. 
iMd^ F.M. Some relations between Guttman*s principal components of 

scale analysis and other psychometric theory. Psychametrihat 19.56, 

23, 291-296- 

Miller, R.fi. Ta^Jc description and analysis. In B.M. Gagne (Ed.), Piycho- 
logical principles in system development. New York: IIolt» Rinehart, 
and Wia^on, 1962. 

Nitko» A.jf. Criterion-referenced testing in the context of instruction. In 
Testing in turmoil: A conference on pr(}hlems and issues in educational 
measurement. Greenwich, Conn.: Bdncattonal Records fiureaiif 1970. 

Ostxim. Il.G. Item sampling for achievement testing. Educutionnl and 
Psy<JH}t(^al Measurement, 1968, 2^. 95-104. 

Rcsaick, L.O.. Wang. M.C, dc Kapbn, J. Behavioral analysis in curriculum 
dc*^gn; A hierarchically .se({i>enccd inlrodt>ctory mathematics curricu- 
lum. Monograph 2, Pittsburgh. Pa.: University of Pittsburgh. Learning 
Reicarch and Etevelopmenl Center, Det^cmlwr. 1970. 

Siinoii, G.B. Comments on '^Implications of criterion-referenced measnre- 
mcnl." Journal of Educational Measurement, 1969. 6, 259-260.' 

Stanley, j[.C.,dcIIopk]n5. K.D. Educational and psychological measurement 
and evaluation. Englewood Cliffs. N.J.; Prentice-Hall, 1972. 

Stenner, A.J., & Wel>ster, W.jf. Educational program aitdit handl>ook. 
Arlington, Va.: The Institute for the Development of Educational 
Auditing, 197K 



ERIC 



26 



S2 PROBI.KMS IN CRiTKKION-RKFEKENCKD MKASUltEMKNT 

Watd, A. Seifumiial analysis. New York: Wiley, 1947. 

Wang, M.C.. RiMtiifck, l^B.. & Boozer, R,F. The seftiience of (kvelopment 
offiome early mathematic;; l>chavtors. PtihIicaHon ]971/(;. PfUMmr^li, 
F^: University of Pittslnirgh, Learning Research an<i Development 
Center, 1971. 



ERIC 



21 



